STA6235: Modeling in Regression
\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
candidatevotes: votes received by this candidate for this particular party
totalvotes: total number of votes cast for this election
year: year in which election was held
region: region that state belongs to, as defined by the US Census Bureau
library(tidyverse)
library(fastDummies)
house <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-07/house.csv') %>%
filter(stage == "GEN" &
party %in% c("REPUBLICAN", "DEMOCRAT") &
state != "DISTRICT OF COLUMBIA") %>%
mutate(candidatevotes = candidatevotes + 1,
year10 = year/10,
region = if_else(state %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT", "NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Northeast",
if_else(state %in% c("ILLINOIS", "INDIANA", "MICHIGAN", "OHIO", "WISCONSIN", "IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "Midwest",
if_else(state %in% c("DELAWARE", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA","WEST VIRGINIA", "ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE", "ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "South", "West")))) %>%
dummy_cols(select_columns = c("region"))f(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!},
where k is the count (k \in \mathbb{Z}^+).
A requirement of this distribution is that the mean (\lambda) equals the variance (\lambda).
To check this, we just find the mean and variance of the count outcome we are looking at.
\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
What is different? The underlying distribution.
We now are applying the negative binomial distribution:
f(y; r, p) = {y + r -1 \choose y-1} (1-p)^y p^r
Why does the negative binomial handle overdispersed data better?
The variance is as follows,
\text{var}[Y] = \mu + \frac{\mu^2}{k},
Here, k is an additional parameter, called the dispersion parameter.
As k \to \infty, \text{nb}(r, p) \to \text{Poi}(\lambda).
What does this mean for us?
Why do we care if the data is overdispersed?
When data is overdispersed, we know that the standard error is underestimated.
Let’s think about how test statistics are created:
z_i = \frac{\hat{\beta}_i}{\text{SE}_{\hat{\beta}_i}}
As the standard error reduces, our test statistic increases.
As our test statistic increases, our p-value decreases.
We are inflating the type I error rate when applying the Poisson to overdispersed data!
We now will use the glm.nb() function from the MASS package when specifying the negative binomial distribution.
!! WARNING !! the MASS package will overwrite the select() function from dplyr (tidyverse).
I prefer to call it using MASS::glm.nb() rather than calling the MASS package into R.
Note that we do not need to specify the distribution inside the function!
Let’s reproduce the example from the Poisson lecture.
We will use the election data to model the number of votes received as a function of the year, the region, the interaction between the party and the year, and the interaction between the party and region.
Call:
MASS::glm.nb(formula = candidatevotes ~ year10 + party + region_Midwest +
region_Northeast + region_West + party:year10 + party:region_Midwest +
party:region_Northeast + party:region_West + totalvotes,
data = house, init.theta = 3.044428116, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.611e+01 9.049e-01 17.801 < 2e-16 ***
year10 -2.807e-02 4.549e-03 -6.172 6.76e-10 ***
partyREPUBLICAN -6.225e+00 1.184e+00 -5.260 1.44e-07 ***
region_Midwest 3.022e-02 1.567e-02 1.928 0.0538 .
region_Northeast 9.109e-02 1.619e-02 5.625 1.86e-08 ***
region_West 7.021e-02 1.580e-02 4.444 8.83e-06 ***
totalvotes 4.770e-06 6.079e-08 78.466 < 2e-16 ***
year10:partyREPUBLICAN 3.127e-02 5.916e-03 5.286 1.25e-07 ***
partyREPUBLICAN:region_Midwest -3.984e-02 2.191e-02 -1.818 0.0690 .
partyREPUBLICAN:region_Northeast -2.792e-01 2.318e-02 -12.044 < 2e-16 ***
partyREPUBLICAN:region_West -1.634e-01 2.250e-02 -7.261 3.83e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(3.0444) family taken to be 1)
Null deviance: 28139 on 19566 degrees of freedom
Residual deviance: 20959 on 19556 degrees of freedom
AIC: 479769
Number of Fisher Scoring iterations: 1
Theta: 3.0444
Std. Err.: 0.0298
2 x log-likelihood: -479745.1550
\begin{align*} \ln(\hat{y}) =& 16.11 - 0.03 \text{ year} - 6.22 \text{ repub.} + 0.03 \text{ midwest} + 0.09 \text{ northeast} + 0.07 \text{ west} \\ & + 0.03 (\text{repub. $\times$ year}) \\ & - 0.04 (\text{repub. $\times$ midwest}) - 0.28 (\text{repub. $\times$ northeast}) - 0.16 (\text{repub. $\times$ west}) \end{align*}
confint() function. 2.5 % 97.5 %
(Intercept) 1.431317e+01 1.790104e+01
year10 -3.710638e-02 -1.904309e-02
partyREPUBLICAN -8.566580e+00 -3.883095e+00
region_Midwest -4.307423e-04 6.091404e-02
region_Northeast 5.945774e-02 1.227981e-01
region_West 3.925256e-02 1.012285e-01
totalvotes 4.631954e-06 4.907906e-06
year10:partyREPUBLICAN 1.956598e-02 4.297417e-02
partyREPUBLICAN:region_Midwest -8.282722e-02 3.149976e-03
partyREPUBLICAN:region_Northeast -3.247801e-01 -2.336813e-01
partyREPUBLICAN:region_West -2.075389e-01 -1.191787e-01
We take the same approach to determining significance as before.
A single term in the model \to summary().
Multiple terms in the model \to full/reduced anova().
Call:
MASS::glm.nb(formula = candidatevotes ~ year10 + party + region_Midwest +
region_Northeast + region_West + party:year10 + party:region_Midwest +
party:region_Northeast + party:region_West + totalvotes,
data = house, init.theta = 3.044428116, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.611e+01 9.049e-01 17.801 < 2e-16 ***
year10 -2.807e-02 4.549e-03 -6.172 6.76e-10 ***
partyREPUBLICAN -6.225e+00 1.184e+00 -5.260 1.44e-07 ***
region_Midwest 3.022e-02 1.567e-02 1.928 0.0538 .
region_Northeast 9.109e-02 1.619e-02 5.625 1.86e-08 ***
region_West 7.021e-02 1.580e-02 4.444 8.83e-06 ***
totalvotes 4.770e-06 6.079e-08 78.466 < 2e-16 ***
year10:partyREPUBLICAN 3.127e-02 5.916e-03 5.286 1.25e-07 ***
partyREPUBLICAN:region_Midwest -3.984e-02 2.191e-02 -1.818 0.0690 .
partyREPUBLICAN:region_Northeast -2.792e-01 2.318e-02 -12.044 < 2e-16 ***
partyREPUBLICAN:region_West -1.634e-01 2.250e-02 -7.261 3.83e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(3.0444) family taken to be 1)
Null deviance: 28139 on 19566 degrees of freedom
Residual deviance: 20959 on 19556 degrees of freedom
AIC: 479769
Number of Fisher Scoring iterations: 1
Theta: 3.0444
Std. Err.: 0.0298
2 x log-likelihood: -479745.1550
f <- MASS::glm.nb(candidatevotes ~ year10 + party + region_Midwest + region_Northeast + region_West + party:year10 + party:region_Midwest + party:region_Northeast + party:region_West + totalvotes, data = house)
r <- MASS::glm.nb(candidatevotes ~ year10 + party + region_Midwest + region_Northeast + region_West + party:year10 + totalvotes, data = house)
anova(r, f, test = "Chisq")\begin{align*} \ln(\hat{y}) =& -2.62 + 0.07 \text{ year} - 4.40 \text{ repub.} + 0.11 \text{ midwest} + 0.12 \text{ northeast} + 0.11 \text{ west} \\ & + 0.02 (\text{repub. $\times$ year}) \\ & - 0.06 (\text{repub. $\times$ midwest}) - 0.31 (\text{repub. $\times$ northeast}) - 0.19 (\text{repub. $\times$ west}) \end{align*}
\begin{align*} \ln(\hat{y}) =& 16.11 - 0.03 \text{ year} - 6.22 \text{ repub.} + 0.03 \text{ midwest} + 0.09 \text{ northeast} + 0.07 \text{ west} \\ & + 0.03 (\text{repub. $\times$ year}) \\ & - 0.04 (\text{repub. $\times$ midwest}) - 0.28 (\text{repub. $\times$ northeast}) - 0.16 (\text{repub. $\times$ west}) \end{align*}
We now have learned modeling for count data.
Mean = variance \to Poisson.
Mean >>> variance \to negative binomial.
Note that there is a method to handle excessive zeros in the model: zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB).
In today’s assignment, you will remove the outlier, reanalyze the data, and compare the results.